library(tidyverse)
library(ggrepel)Notebook 09 – Solutions
DSST289: Introduction to Data Science
1 Getting Started
Before running this notebook, select Session > Restart R and Clear Output.
After starting a new session, run the following code to load the libraries and data we’ll use today.
2 Flights
2.1 Datasets
In this notebook, we’ll work with several data tables about commercial flights that departed from Richmond International Airport (RIC) in 2019. The data are in five tables that can be joined using different keys. Most variables are self-explanatory; ask if you have any questions.
There is a table of airlines:
airlines <- read_csv("../data/flightsrva_airlines.csv.gz")
airlinesA table of airports:
airports <- read_csv("../data/flightsrva_airports.csv.gz")
airportsA table of flights:
flights <- read_csv("../data/flightsrva_flights.csv.gz") |>
select(-time_hour) # removing for now until we discuss datetime objects
flightsA table of planes:
planes <- read_csv("../data/flightsrva_planes.csv.gz")
planesAnd a table of weather:
weather <- read_csv("../data/flightsrva_weather.csv.gz") |>
select(-time_hour) # removing for now until we discuss datetime objects
weatherBefore you proceed, take a moment to determine what (if any) is the primary key of each table.
Don’t forget that you can click the table names in your Environment window (top-right quadrant of RStudio) to see the tables.
2.2 Carriers
In the first question, create a barplot showing the number of flights each carrier made in 2019 out of RIC. Order the barplot in descending order of the count of flights. Put the count on the x-axis and the airline on the y-axis.
This line may help: mutate(carrier = fct_inorder(carrier))
The plot above is hard to read because you may not know all the airline carrier codes.
Repeat the plot using the full airline names on the y-axis instead of carrier codes. Use a left_join with the airlines dataset to obtain the full names.
flights |>
group_by(carrier) |>
summarize(n = n()) |>
arrange(desc(n)) |>
left_join(airlines, by = "carrier") |>
mutate(name = fct_inorder(name)) |>
ggplot(aes(n, name)) +
geom_col()Do you recognize the carrier with the most flights out of the city?
2.3 Destinations
Now, produce a similar barplot counting the number of outbound flights from RIC based on the flight’s destination. Put the destination code on the y-axis, counts on the x-axis, and order in descending order of count.
flights |>
group_by(dest) |>
summarize(n = n()) |>
arrange(desc(n)) |>
mutate(dest = fct_inorder(dest)) |>
ggplot(aes(n, dest)) +
geom_col()Airport codes are a bit easier to figure out than airline codes, but you might still only know the top few from the previous plot. Modify your plot to use the full airport names on the y-axis instead of destination codes. Join with the airports data to obtain the full names.
Be careful with this; it is slightly more difficult than the join you did in question 2. You will need to use the faa column. “FAA” stands for the Federal Aviation Administration.
flights |>
group_by(dest) |>
summarize(n = n()) |>
arrange(desc(n)) |>
left_join(airports, by = c("dest" = "faa")) |>
mutate(name = fct_inorder(name)) |>
ggplot(aes(n, name)) +
geom_col()The airports table has longitude and latitude in addition to the airport names. Let’s create a plot with longitude on the x-axis, latitude on the y-axis, and a point for each airport that was a flight destination from RIC in 2019. Make the size of the points scale with the number of flights and add text labels to give the airport FAA code for each city. Finally, use labs to add proper labels to the aesthetics and a title.
flights |>
group_by(dest) |>
summarize(n = n()) |>
arrange(desc(n)) |>
left_join(airports, by = c("dest" = "faa")) |>
mutate(name = fct_inorder(name)) |>
ggplot(aes(lon, lat)) +
geom_point(aes(size = n)) +
geom_text_repel(aes(label = dest)) +
scale_size_area() +
labs(
x = "", y = "", size = "Number of Flights",
title = "Number of Outbound Flights from RIC (2019)"
)Let’s add one more element to the plot above. In the code below, modify the previous plot by including a large red dot showing RIC airport.
ric <- airports |>
filter(faa == "RIC")
flights |>
group_by(dest) |>
summarize(n = n()) |>
arrange(desc(n)) |>
left_join(airports, by = c("dest" = "faa")) |>
mutate(name = fct_inorder(name)) |>
ggplot(aes(lon, lat)) +
geom_point(aes(size = n)) +
geom_text_repel(aes(label = dest)) +
geom_point(data = ric, color = "red", size = 3) +
geom_text_repel(data = ric, aes(label = faa), color = "red") +
scale_size_area() +
labs(
x = "Latitude", y = "Longitude", size = "Number of Flights",
title = "Number of Outbound Flights from RIC (2019)"
)2.4 Planes
Take the flights table and join to the planes table using the key tailnum. Look at the resulting data, especially the name of the first column, and compare it to the first column in the original flights table.
flights |>
left_join(planes, by = "tailnum") |>
select(tailnum, year.x, year.y, manufacturer)Both flights and planes have a column called year. When this happens, by default, they will both be included in the join with names following the pattern year.x and year.y.
In these datasets, year means different things. In flights, year is the year of the flight; in planes, it’s the year the plane was built.
Repeat the last question but add the option suffix = c("", "_plane") to the left_join function. Look at the new column names and see how this solved the problem.
flights |>
left_join(planes, by = "tailnum", suffix = c("", "_plane")) |>
select(tailnum, year, year_plane, manufacturer)Using the joined data, create a bar plot showing the number of flights from RIC for each plane’s manufacturing year (year_plane). Place year_plane on the x-axis in chronological order (do not reorder by frequency). Do you notice any patterns?
flights |>
left_join(planes, by = "tailnum", suffix = c("", "_plane")) |>
group_by(year_plane) |>
summarize(n = n()) |>
ggplot(aes(year_plane, n)) +
geom_col()Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_col()`).
2.5 Weather
Join the flights and weather tables below. You’ll need 5 columns to properly define the join.
flights |>
left_join(weather, by = c("year", "month", "day", "hour", "origin"))2.6 cut()
The cut function takes two arguments: a numeric variable and the number of buckets. It converts numeric values into ordered categories by binning them into equally sized intervals.
Next, join the flights and weather tables, and create a new variable that cuts wind_gust into 10 buckets. Take a moment to examine and understand the output.
flights |>
left_join(weather, by = c("year", "month", "day", "hour", "origin")) |>
mutate(wind_gust_cat = cut(wind_gust, 10))Starting with your code above, group the data by wind gust buckets. Then compute the average flight delay and the number of flights in each bucket (remember to remove missing values in the mean function). Create a scatter plot showing average delay on the x-axis, wind categories on the y-axis, and use point size to represent the number of flights.
flights |>
left_join(weather, by = c(
"year", "month",
"day", "hour", "origin"
)) |>
mutate(wind_gust_cat = cut(wind_gust, 10)) |>
group_by(wind_gust_cat) |>
summarize(avg_delay = mean(dep_delay, na.rm = TRUE), n = n()) |>
ggplot(aes(avg_delay, wind_gust_cat)) +
geom_point(aes(size = n))Does there appear to be a relationship between high wind speeds and departure delays?
2.7 Big Planes
To finish, let’s see how we can make use of a filter join. Create a table called big_planes that contains only planes with more than 200 seats.
big_planes <- planes |>
filter(seats > 200)Repeat the code from question 4 to create a bar plot showing the number of flights to each destination, but only include flights taken by planes with more than 200 seats. Use a filter join with big_planes to filter the flights.
flights |>
semi_join(big_planes, by = "tailnum") |>
group_by(dest) |>
summarize(n = n()) |>
arrange(desc(n)) |>
left_join(airports, by = c("dest" = "faa")) |>
mutate(name = fct_inorder(name)) |>
ggplot(aes(n, name)) +
geom_col()Where are all of the big that depart from RIC going?
Nearly every larger plane in the dataset is going to Atlanta.